A simple (ie. no error checking or sensible engineering) notebook to extract the student answer data from a single xml file.

I'll also export the data to a csv file at the end of this, so that it's easy to read in at the beginning of another notebook.

Following discussions with Suraj, we want the representation to take into account the student's response, the official answer, and the grade. So there'll be a little fiddliness linking the student response back to the gold standard response.

So, first read the file:



In [3]:

    
filename='semeval2013-task7/semeval2013-Task7-5way/beetle/train/Core/FaultFinding-BULB_C_VOLTAGE_EXPLAIN_WHY1.xml'

It's an xml file, so we'll need the xml.etree parser, and pandas so that we can import into a dataframe:



In [4]:

    
import pandas as pd

from xml.etree import ElementTree as ET



In [7]:

    
tree=ET.parse(filename)

r=tree.getroot()

Now, the reference answers are in the second daughter node of the tree. We can extract these and store them in a dictionary. To distinguish between reference answer tokens and student response tokens, I'm going to append each token in the reference answers with _RA, and each of the tokens in a student response with _SR.



In [30]:

    
from string import punctuation

def to_tokens(textIn):
    '''Convert the input textIn to a list of tokens'''
    tokens_ls=[t.lower().strip(punctuation) for t in textIn.split()]
    # remove any empty tokens
    return [t for t in tokens_ls if t]

str='"Help!" yelped the banana, who was obviously scared out of his skin.'
print(str)
print(to_tokens(str))









    



"Help!" yelped the banana, who was obviously scared out of his skin.
['help', 'yelped', 'the', 'banana', 'who', 'was', 'obviously', 'scared', 'out', 'of', 'his', 'skin']



In [50]:

    
refAnswers_dict={refAnswer.attrib['id']:[t+'_RA' for t in to_tokens(refAnswer.text)] 
                 for refAnswer in r[1]}    
refAnswers_dict









    Out[50]:





{'answer204': ['terminal_RA',
  '1_RA',
  'and_RA',
  'the_RA',
  'positive_RA',
  'terminal_RA',
  'are_RA',
  'separated_RA',
  'by_RA',
  'the_RA',
  'gap_RA'],
 'answer205': ['terminal_RA',
  '1_RA',
  'and_RA',
  'the_RA',
  'positive_RA',
  'terminal_RA',
  'are_RA',
  'not_RA',
  'connected_RA'],
 'answer206': ['terminal_RA',
  '1_RA',
  'is_RA',
  'connected_RA',
  'to_RA',
  'the_RA',
  'negative_RA',
  'battery_RA',
  'terminal_RA'],
 'answer207': ['terminal_RA',
  '1_RA',
  'is_RA',
  'not_RA',
  'separated_RA',
  'from_RA',
  'the_RA',
  'negative_RA',
  'battery_RA',
  'terminal_RA'],
 'answer207.NEW': ['terminal_RA',
  '1_RA',
  'and_RA',
  'the_RA',
  'positive_RA',
  'battery_RA',
  'terminal_RA',
  'are_RA',
  'in_RA',
  'different_RA',
  'electrical_RA',
  'states_RA']}

Next, we need to extract each of the student responses. These are in the third daughter node:



In [41]:

    
print(r[2][0].text)
r[2][0].attrib









    



positive battery terminal is separated by a gap from terminal 1






    Out[41]:





{'accuracy': 'correct',
 'answerMatch': 'answer204',
 'count': '1',
 'id': 'FaultFinding-BULB_C_VOLTAGE_EXPLAIN_WHY1.sbj3-l1.qa193'}



In [58]:

    
responses_ls=[]
for (i, studentResponse) in enumerate(r[2]):
    if 'answerMatch' in studentResponse.attrib:
        matchTokens_ls=refAnswers_dict[studentResponse.attrib['answerMatch']]
    else:
        matchTokens_ls=[]
    responses_ls.append({'accuracy':studentResponse.attrib['accuracy'],
                         'text':studentResponse.text,
                         'tokens':[t+'_SR' for t in to_tokens(studentResponse.text)] + matchTokens_ls})

responses_ls[36]









    Out[58]:





{'accuracy': 'correct',
 'text': 'the positive battery terminal and terminal 1 are not connected',
 'tokens': ['the_SR',
  'positive_SR',
  'battery_SR',
  'terminal_SR',
  'and_SR',
  'terminal_SR',
  '1_SR',
  'are_SR',
  'not_SR',
  'connected_SR',
  'terminal_RA',
  '1_RA',
  'and_RA',
  'the_RA',
  'positive_RA',
  'terminal_RA',
  'are_RA',
  'not_RA',
  'connected_RA']}

OK, that seems to work OK. Now, let's define a function that takes a filename, and returns the list of token dictionaries:



In [66]:

    
def extract_token_dictionaries(filenameIn):
    
    # Localise the to_tokens function
    def to_tokens_local(textIn):
        '''Convert the input textIn to a list of tokens'''
        tokens_ls=[t.lower().strip(punctuation) for t in textIn.split()]
        # remove any empty tokens
        return [t for t in tokens_ls if t]

    tree=ET.parse(filenameIn)
    root=tree.getroot()
    
    refAnswers_dict={refAnswer.attrib['id']:[t+'_RA' for t in to_tokens_local(refAnswer.text)]
                     for refAnswer in root[1]}

    responsesOut_ls=[]
    for (i, studentResponse) in enumerate(root[2]):
        if 'answerMatch' in studentResponse.attrib:
            matchTokens_ls=refAnswers_dict[studentResponse.attrib['answerMatch']]
        else:
            matchTokens_ls=[]
        responsesOut_ls.append({'accuracy':studentResponse.attrib['accuracy'],
                                'text':studentResponse.text,
                                'tokens':[t+'_SR' for t in to_tokens_local(studentResponse.text)] \
                                          + matchTokens_ls})
    return responsesOut_ls

We now have a function which takes a filename and returns a list of tokenised student responses and reference answers:



In [68]:

    
extract_token_dictionaries(filename)[:2]









    Out[68]:





[{'accuracy': 'correct',
  'text': 'positive battery terminal is separated by a gap from terminal 1',
  'tokens': ['positive_SR',
   'battery_SR',
   'terminal_SR',
   'is_SR',
   'separated_SR',
   'by_SR',
   'a_SR',
   'gap_SR',
   'from_SR',
   'terminal_SR',
   '1_SR',
   'terminal_RA',
   '1_RA',
   'and_RA',
   'the_RA',
   'positive_RA',
   'terminal_RA',
   'are_RA',
   'separated_RA',
   'by_RA',
   'the_RA',
   'gap_RA']},
 {'accuracy': 'correct',
  'text': 'terminal 1 is not connected to the positive terminal',
  'tokens': ['terminal_SR',
   '1_SR',
   'is_SR',
   'not_SR',
   'connected_SR',
   'to_SR',
   'the_SR',
   'positive_SR',
   'terminal_SR',
   'terminal_RA',
   '1_RA',
   'and_RA',
   'the_RA',
   'positive_RA',
   'terminal_RA',
   'are_RA',
   'not_RA',
   'connected_RA']}]

So next we need to be able to build a document frequency dictionary from a list of tokenised documents.



In [73]:

    
def document_frequencies(listOfTokenLists):
    # Build the dictionary of all tokens used:
    token_set=set()
    for tokenList in listOfTokenLists:
        token_set=token_set.union(set(tokenList))
        
    # Then return the document frequency counts for each token
    
    return {t:len([l for l in listOfTokenLists if t in l])
            for t in token_set}



In [81]:

    
tokenLists_ls=[x['tokens'] for x in extract_token_dictionaries(filename)]
document_frequencies(tokenLists_ls)









    Out[81]:





{'1.5_SR': 3,
 '1_RA': 55,
 '1_SR': 40,
 '2_SR': 1,
 'a_SR': 31,
 'and_RA': 48,
 'and_SR': 20,
 'answer_SR': 1,
 'any_SR': 1,
 'are_RA': 48,
 'are_SR': 12,
 'aren"t_SR': 1,
 'at_SR': 3,
 'batteries_SR': 1,
 'battery"s_SR': 1,
 'battery_RA': 7,
 'battery_SR': 39,
 'becaquse_SR': 1,
 'because_SR': 28,
 'becuase_SR': 1,
 'between_SR': 9,
 'both_SR': 2,
 'bulb_SR': 7,
 'by_RA': 26,
 'by_SR': 10,
 'c_SR': 1,
 'charge_SR': 2,
 'circuit_SR': 3,
 'closed_SR': 1,
 'closing_SR': 1,
 'components_SR': 1,
 'connected_RA': 29,
 'connected_SR': 50,
 'connection_SR': 5,
 'contact_SR': 1,
 'created_SR': 1,
 'damaged_SR': 3,
 'difference_SR': 1,
 'different_SR': 3,
 'dint_SR': 1,
 'direct_SR': 1,
 'do_SR': 1,
 'dont_SR': 1,
 'each_SR': 2,
 'electrical_SR': 3,
 'end_SR': 1,
 'from_SR': 6,
 'gap_RA': 26,
 'gap_SR': 27,
 'gaps_SR': 1,
 'get_SR': 1,
 'had_SR': 2,
 'has_SR': 1,
 'have_SR': 1,
 'he_SR': 2,
 'i_SR': 4,
 'in_SR': 4,
 'is_RA': 7,
 'is_SR': 54,
 'it_SR': 6,
 'its_SR': 2,
 'know_SR': 2,
 'making_SR': 1,
 'me_SR': 1,
 'negative_RA': 7,
 'negative_SR': 13,
 'no_SR': 9,
 'not_RA': 22,
 'not_SR': 26,
 'of_SR': 2,
 'on_SR': 2,
 'one_SR': 8,
 'other_SR': 2,
 'path_SR': 1,
 'positive_RA': 48,
 'positive_SR': 52,
 'posittive_SR': 1,
 'positve_SR': 1,
 'postive_SR': 1,
 'psoitive_SR': 1,
 'reading_SR': 1,
 'same_SR': 1,
 'separated_RA': 26,
 'separated_SR': 6,
 'separates_SR': 1,
 'separation_SR': 2,
 'separted_SR': 1,
 'seperated_SR': 7,
 'so_SR': 1,
 'state_SR': 1,
 'states_SR': 3,
 'tell_SR': 1,
 'termianl_SR': 1,
 'terminal_RA': 55,
 'terminal_SR': 68,
 'terminals_SR': 6,
 'the_RA': 55,
 'the_SR': 71,
 'thebulb_SR': 1,
 'their_SR': 1,
 'then_SR': 2,
 'there_SR': 20,
 'they_SR': 3,
 'to_RA': 7,
 'to_SR': 42,
 'tot_SR': 1,
 'two_SR': 2,
 'understand_SR': 1,
 'v_SR': 1,
 'voltage_SR': 3,
 'was_SR': 18,
 'with_SR': 3}

Next, define a function which takes a list of tokens and a document frequency dictionary, and returns a dictionary of the tf.idf values for each of the tokens in the list. Note: for this function, if a token isn't in the document frequency dictionary, then it won't be returned in the tf.idf dictionary.

We can use the collections.Counter object to get the tf values.



In [82]:

    
from collections import Counter



In [86]:

    
def get_tfidf(tokens_ls, docFreq_dict):
    tf_dict=Counter(tokens_ls)
    return {t:tf_dict[t]/docFreq_dict[t] for t in tf_dict if t in docFreq_dict}



In [88]:

    
get_tfidf('the cat sat on the mat'.split(), {'cat':2, 'the':1})









    Out[88]:





{'cat': 0.5, 'the': 2.0}

Finally, we want to convert the outputs for all of the responses into a dataframe.



In [105]:

    
# Extract the data from the file:
tokenDictionaries_ls=extract_token_dictionaries(filename)

# Build the lists of responses:
tokenLists_ls=[x['tokens'] for x in extract_token_dictionaries(filename)]

# Build the document frequency dict
docFreq_dict=document_frequencies(tokenLists_ls)

# Create the tf.idf for each response:
tfidf_ls=[get_tfidf(tokens_ls, docFreq_dict) for tokens_ls in tokenLists_ls]

# Now, create a dataframe which is indexed by the token dictionary:
trainingText_df=pd.DataFrame(index=docFreq_dict.keys())

# Use the index of responses in the list as column headers:
for (i, tokens_ls) in enumerate(tfidf_ls):
    trainingText_df[i]=pd.Series(tokens_ls, index=trainingText_df.index)

# Finally, transpose, and replace the NaNs with 0:
trainingText_df.fillna(0).T









    Out[105]:






  
    
      
      no_SR
      in_SR
      by_RA
      closed_SR
      know_SR
      two_SR
      end_SR
      direct_SR
      not_RA
      gap_RA
      ...
      is_RA
      components_SR
      its_SR
      to_RA
      had_SR
      other_SR
      they_SR
      charge_SR
      was_SR
      at_SR
    
  
  
    
      0
      0.000000
      0.00
      0.038462
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.038462
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      1
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.045455
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      2
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      3
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      4
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      5
      0.111111
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      6
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      7
      0.000000
      0.00
      0.038462
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.038462
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      8
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.045455
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      9
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      10
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      11
      0.000000
      0.00
      0.000000
      1.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      12
      0.000000
      0.00
      0.038462
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.038462
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      13
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.045455
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      14
      0.111111
      0.25
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.055556
      0.0
    
    
      15
      0.111111
      0.25
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.055556
      0.0
    
    
      16
      0.000000
      0.00
      0.038462
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.038462
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      17
      0.000000
      0.00
      0.038462
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.038462
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.055556
      0.0
    
    
      18
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      1.0
      0.045455
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.055556
      0.0
    
    
      19
      0.000000
      0.25
      0.038462
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.038462
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.055556
      0.0
    
    
      20
      0.000000
      0.00
      0.038462
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.038462
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.055556
      0.0
    
    
      21
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.045455
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      22
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.055556
      0.0
    
    
      23
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.055556
      0.0
    
    
      24
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.5
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      25
      0.000000
      0.25
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      26
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.142857
      0.0
      0.0
      0.142857
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      27
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      28
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.142857
      0.0
      0.0
      0.142857
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      29
      0.111111
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      73
      0.000000
      0.00
      0.038462
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.038462
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.055556
      0.0
    
    
      74
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.045455
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      75
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.045455
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.055556
      0.0
    
    
      76
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      77
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.045455
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      78
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.055556
      0.0
    
    
      79
      0.000000
      0.00
      0.038462
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.038462
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.333333
      0.0
      0.000000
      0.0
    
    
      80
      0.000000
      0.00
      0.038462
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.038462
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      81
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.045455
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      82
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.055556
      0.0
    
    
      83
      0.000000
      0.00
      0.000000
      0.0
      0.5
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      84
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      85
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.055556
      0.0
    
    
      86
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      87
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.333333
      0.0
      0.000000
      0.0
    
    
      88
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.045455
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.333333
      0.0
      0.000000
      0.0
    
    
      89
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.142857
      0.0
      0.0
      0.142857
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      90
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.5
      0.0
      0.0
      0.045455
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      91
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.045455
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.5
      0.000000
      0.0
      0.000000
      0.0
    
    
      92
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.045455
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.5
      0.000000
      0.0
      0.000000
      0.0
    
    
      93
      0.000000
      0.00
      0.038462
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.038462
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      94
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.055556
      0.0
    
    
      95
      0.000000
      0.00
      0.038462
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.038462
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.5
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      96
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.055556
      0.0
    
    
      97
      0.000000
      0.00
      0.038462
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.038462
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.5
      0.0
      0.000000
      0.5
      0.000000
      0.0
    
    
      98
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.5
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      99
      0.000000
      0.00
      0.038462
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.038462
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      100
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.045455
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
    
      101
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.000000
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.055556
      0.0
    
    
      102
      0.000000
      0.00
      0.000000
      0.0
      0.0
      0.0
      0.0
      0.0
      0.045455
      0.000000
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
    
  

103 rows × 112 columns

Cool, that seems to work. Now just need to do it for the complete set of files. Just use beetle/train/core for the time being.



In [107]:

    
!ls semeval2013-task7/semeval2013-Task7-5way/beetle/train/Core/









    



semeval2013-Task7-2and3way url.txt
semeval2013-Task7-5way

Use os.walk to get the files:



In [114]:

    
import os

We can now do the same as before, but this time using all the files to construct the final dataframe. We also need a series containing the accuracy measures.



In [137]:

    
tokenDictionaries_ls=[]

# glob would have been easier...
for (root, dirs, files) in os.walk('semeval2013-task7/semeval2013-Task7-5way/beetle/train/Core/'):
    for filename in files:
        if filename[-4:]=='.xml':
            tokenDictionaries_ls.extend(extract_token_dictionaries(os.path.join(root, filename)))

# Now we've extracted the information from all the files. We can now construct the dataframe
# in the same way as before:

# Build the lists of responses:
tokenLists_ls=[x['tokens'] for x in tokenDictionaries_ls]

# Build the document frequency dict
docFreq_dict=document_frequencies(tokenLists_ls)

# Now, create a dataframe which is indexed by the tokens
# in the token frequency dictionary:
trainingText_df=pd.DataFrame(index=docFreq_dict.keys())

# Populate the dataframe with the tf.idf for each response. Also,
# create a dictionary of the accuracy values while we're at it.
accuracy_dict={}
for (i, response_dict) in enumerate(tokenDictionaries_ls):
    trainingText_df[i]=pd.Series(get_tfidf(response_dict['tokens'], docFreq_dict), 
                                 index=trainingText_df.index)
    accuracy_dict[i]=response_dict['accuracy']

# Finally, transpose, and replace the NaNs with 0:
trainingText_df=trainingText_df.fillna(0).T

# Also, to make it easier to store in a single csv file, let's put the accuracy
# values in a column (won't clash with any occurences of the token "accuracy" 
# because we've changed the tokens to "accuracy_SR" and "accuracy_RA":

trainingText_df['accuracy']=pd.Series(accuracy_dict)



In [138]:

    
trainingText_df.head()









    Out[138]:






  
    
      
      in_SR
      germinal_SR
      theya_SR
      affected_SR
      interruption_SR
      locate_SR
      see_SR
      cnnected_SR
      differnt_SR
      3_RA
      ...
      components_SR
      to_RA
      means_SR
      s_SR
      burns_RA
      electrica_SR
      seriously_SR
      lit_SR
      difference_RA
      accuracy
    
  
  
    
      0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      correct
    
    
      1
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      correct
    
    
      2
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      contradictory
    
    
      3
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      contradictory
    
    
      4
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      contradictory
    
  

5 rows × 1117 columns

And finish by exporting to a csv file:



In [141]:

    
trainingText_df.to_csv('beetleTrainingData.csv', index=False)

Done! Now can import the data into a dataframe with:

pd.read_csv('beetleTrainingData.csv')

	no_SR	in_SR	by_RA	closed_SR	know_SR	two_SR	end_SR	direct_SR	not_RA	gap_RA	...	is_RA	components_SR	its_SR	to_RA	had_SR	other_SR	they_SR	charge_SR	was_SR	at_SR
0	0.000000	0.00	0.038462	0.0	0.0	0.0	0.0	0.0	0.000000	0.038462	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
1	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.045455	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
2	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
3	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
4	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
5	0.111111	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
6	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
7	0.000000	0.00	0.038462	0.0	0.0	0.0	0.0	0.0	0.000000	0.038462	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
8	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.045455	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
9	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
10	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
11	0.000000	0.00	0.000000	1.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
12	0.000000	0.00	0.038462	0.0	0.0	0.0	0.0	0.0	0.000000	0.038462	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
13	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.045455	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
14	0.111111	0.25	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.055556	0.0
15	0.111111	0.25	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.055556	0.0
16	0.000000	0.00	0.038462	0.0	0.0	0.0	0.0	0.0	0.000000	0.038462	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
17	0.000000	0.00	0.038462	0.0	0.0	0.0	0.0	0.0	0.000000	0.038462	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.055556	0.0
18	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	1.0	0.045455	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.055556	0.0
19	0.000000	0.25	0.038462	0.0	0.0	0.0	0.0	0.0	0.000000	0.038462	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.055556	0.0
20	0.000000	0.00	0.038462	0.0	0.0	0.0	0.0	0.0	0.000000	0.038462	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.055556	0.0
21	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.045455	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
22	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.055556	0.0
23	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.055556	0.0
24	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.5	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
25	0.000000	0.25	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
26	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.142857	0.0	0.0	0.142857	0.0	0.0	0.000000	0.0	0.000000	0.0
27	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
28	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.142857	0.0	0.0	0.142857	0.0	0.0	0.000000	0.0	0.000000	0.0
29	0.111111	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
73	0.000000	0.00	0.038462	0.0	0.0	0.0	0.0	0.0	0.000000	0.038462	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.055556	0.0
74	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.045455	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
75	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.045455	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.055556	0.0
76	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
77	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.045455	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
78	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.055556	0.0
79	0.000000	0.00	0.038462	0.0	0.0	0.0	0.0	0.0	0.000000	0.038462	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.333333	0.0	0.000000	0.0
80	0.000000	0.00	0.038462	0.0	0.0	0.0	0.0	0.0	0.000000	0.038462	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
81	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.045455	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
82	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.055556	0.0
83	0.000000	0.00	0.000000	0.0	0.5	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
84	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
85	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.055556	0.0
86	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
87	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.333333	0.0	0.000000	0.0
88	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.045455	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.333333	0.0	0.000000	0.0
89	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.142857	0.0	0.0	0.142857	0.0	0.0	0.000000	0.0	0.000000	0.0
90	0.000000	0.00	0.000000	0.0	0.0	0.5	0.0	0.0	0.045455	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
91	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.045455	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.5	0.000000	0.0	0.000000	0.0
92	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.045455	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.5	0.000000	0.0	0.000000	0.0
93	0.000000	0.00	0.038462	0.0	0.0	0.0	0.0	0.0	0.000000	0.038462	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
94	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.055556	0.0
95	0.000000	0.00	0.038462	0.0	0.0	0.0	0.0	0.0	0.000000	0.038462	...	0.000000	0.0	0.0	0.000000	0.5	0.0	0.000000	0.0	0.000000	0.0
96	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.055556	0.0
97	0.000000	0.00	0.038462	0.0	0.0	0.0	0.0	0.0	0.000000	0.038462	...	0.000000	0.0	0.0	0.000000	0.5	0.0	0.000000	0.5	0.000000	0.0
98	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.5	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
99	0.000000	0.00	0.038462	0.0	0.0	0.0	0.0	0.0	0.000000	0.038462	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
100	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.045455	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0
101	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.055556	0.0
102	0.000000	0.00	0.000000	0.0	0.0	0.0	0.0	0.0	0.045455	0.000000	...	0.000000	0.0	0.0	0.000000	0.0	0.0	0.000000	0.0	0.000000	0.0

	...	accuracy
0	...	correct
1	...	correct
2	...	contradictory
3	...	contradictory
4	...	contradictory